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We present a novel method for detecting communities in bipartite networks. Based on an extension of the 
/:-clique community detection algorithm, we demonstrate how modular structure in bipartite networks presents 
itself as overlapping bicliques. If bipartite information is available, the bi-clique community detection algorithm 
retains all of the advantages of the ^-clique algorithm, but avoids discarding important structural information 
when performing a one-mode projection of the network. Further, the bi-clique community detection algorithm 
provides a new level of flexibility by incorporating independent clique thresholds for each of the non-overlapping 
node sets in the bipartite network. 
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I. INTRODUCTION 

The theoretical understanding of the structure and function 
of complex networks has grown rapidly during the past few 
years |[ll 0, ISD- One large component of the field of complex 
networks regards the study of community structure in net- 
works; for reviews see 0,|5l]. Community structure describes 
the property of many networks that nodes are divided into 
'communities' with many intra-community links and sparse 
connections between the densely connected modules. In spite 
of a focused research effort, the mathematical tools developed 
to describe the structure of large complex networks are con- 
tinuously being refined and redefined. 

Currently, the endeavour of detecting community struc- 
ture in complex networks can be divided into two main ap- 
proaches. One main class can be labeled global methods, of 
which the most notable example is the modularity introduced 
by Newman and Girvan 1 6] ; global methods regard commu- 
nity detection as a global optimization problem, where the ob- 
jective function is particular to each method. Due to the com- 
plexity of such optimization problems, the global methods are 
typically stochastic in nature. The other class is local meth- 
ods, where the best known example is the /^-clique method 
described by Palla et ah iH HI] ; here, local structural informa- 
tion is utilized to reveal the community structure of a network. 
The local methods are usually deterministic. 

Although widely studied in the fields of statistics and com- 
puter science ||9l |lOl [ijj, [l^], the study of bipartite networks 
and their community structures has only recently been moving 
into the focus of the network community. So far, all efforts 
have been focused on global community detection methods 
Here we present a simple algorithm — based on 
a local framework — that has considerable power, flexibility, 
and accuracy. 



II. BIPARTITE NETWORKS 

A bipartite network is a network with two non-overlapping 
sets of nodes A and F, where all links must have one end node 
belonging to each set. As is clear from the examples below. 



many real world networks are naturally bipartite: 

• Social Networks. The available data regarding many dif- 
ferent social networks consist of what is known as 'affiliation 
networks' . Examples of affiliation networks include the scien- 
tific collaboration network [ni HI] (where the two node 
sets consist of papers and authors, respectively), the movie- 
actor network, where the network edges connect an actors and 
films 1 19], and artistic collaboration networks 1 18], where a 
link indicates the participation of a creative team. Other ex- 
amples of social networks that can be inferred from bipartite 
data are the movie-recommendation network [20] that links 
users to the movies they have watched, or the song-listener 
network that link music listeners to the music they play on 
their computer lEllH. 

• Biological Networks. Many important types of biological 
networks are naturally bipartite. Examples of bipartite bio- 
logical networks are the metabolic network, where the two 
types of nodes are reactions and metabolites |23], the human 
disease network of genes and diseases I24|] . and the network 
describing drugs and their molecular targets flS] . 

• Information Networks. The bipartite structure is also very 
common for information networks. The generic example is 
a word-document network, where one type of nodes is docu- 
ments (web-pages, emails, dictionary entries, etc) that link to 
the words they contain 1 26ll27ll28l [29|] 

Most of the studies of real world networks listed above, 
do not analyze the bipartite networks directly, but rather one- 
mode projections of the network. Below, we will demonstrate 
how the one-mode projection of a bipartite network disregards 
important network information and argue that a direct analysis 
of the bipartite network is a more natural option that captures 
important nuances of the network structure that are invisible 
to the analyses based on unipartite projections. 

A bipartite network has a bipartite (^a x ^r) adjacency ma- 
trix E, where and ny are the number of nodes in each set. 
This matrix is constructed such that 
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if there is a link between node / and 7, and 



otherwise. 



(1) 



In real networks, this matrix is typically very sparse. Any 
bipartite network can be transformed into two unipartite net- 
works. One network consisting of the ^a nodes in the A set 
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and one network consisting of the nr nodes in the F set. These 
one-mode projections are obtained by calculating the two 
symmetric, weighted matrices the Aa = EE^ and Ar = E^E. 
The diagonal elements An of these matrices contain the num- 
ber of links connected to node / in the bipartite network, and 
the off-diagonal elements Atj contain information on the num- 
ber of nodes in the complementary set are shared by nodes / 
and j. 

The conceptual simplicity of the one-mode projection 
comes at a high cost. First of all, the procedure typically erad- 
icates the sparsity of the E matrix; this is especially problem- 
atic, when constructing the adjacency matrix for the smaller 
set of nodes, in the case where one of the node sets is sig- 
nificantly larger than the other. Secondly, much of the infor- 
mation present in the bipartite state becomes encoded in the 
weights of the adjacency matrix. However, due to (1) techni- 
cal difficulties regarding the analysis of weighted matrices (s^ 
and (2) the high link-density of the one-mode projections (if 
the adjacency matrix is densely populated, all nodes are con- 
nected and the network has very little structure), these matri- 
ces are usually thresholded such that only entries higher than 
some threshold are retained. Similarly, the diagonal of the 
one-mode adjacency matrices is usually set to zero, since self- 
links are not of interest in the subsequent network analysis. 

One aspect that is rarely discussed in the literature is the fact 
that even if we keep all the off-diagonal weights in the one- 
mode adjacency matrix, essential information is lost when 
performing the one mode projection. This is clear from the 
fact that we cannot reconstruct E from A a and Ap. It is, how- 
ever, instructive to study precisely what information is lost. 
Specifically, the problem is that the one-mode adjacency ma- 
trices only contain two-point correlations. Given two nodes, / 
and j in one of the sets, the corresponding adjacency matrix 
informs us how many nodes these two share in the comple- 
mentary set. Given a third node k, we also know the number 
of nodes that are shared by / and k or j and k, respectively 
in the complementary set, but we have no information about 
which nodes from the complementary set that /, j, and k con- 
nect to in common: The same set of nodes could be shared 
by /, J, and k, or the nodes in the complementary set could be 
shared pairwise, but not among all three. A practical example 
of this problem is shown in Figure [T] 

In Figure [T] we display 3 simple bipartite networks. The 
network described in Figure [T] (a) shows a case where all A 
nodes are linked to a single node in the F set. A practical ex- 
ample of this motif can be found in the movie-actor network, 
where this would be the case when 4 individuals act together 
in a single film. In Figure [T] (b) a different network is dis- 
played. Here, all four nodes in the A set are interconnected 
via pair-wise links to six distinct nodes in the F set. In the 
movie-actor network this corresponds to four actors who have 
all been in films together, but with precisely two common ac- 
tors per film; these six movies could be far apart in time and 
space. Therefore the significance of this network motif is very 
different from the significance of the motif displayed in Fig- 
ure [T] (a). Finally, the network in Figure [T](c) lies somewhere 
in between the two other cases. 

Important qualifying information about the nodes shared in 
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FIG. 1: Color online: Three distinct bipartite networks that result 
in identical one-mode projections. In the first case, (a), the nodes 
A = {a,/7,c, share the single node F = {1} in the complementary 
set. (b) The second case, that includes the nodes A = {a,Z?,c, J} 
and the complementary nodes r = {1,2,3,4,5,6}, has every pair of 
nodes in the A level linked via different nodes in the complementary 
set. In the third case, (c) three of the the four A = {a^b^c^d} nodes, 
{a,b,d), share a single node in the complementary node set, while 
all other linkages between A-nodes in this network are pair- wise and 
run via nodes in the complementary set that are exclusive to the two 
nodes linked. 



the complementary set is not carried over in the one mode 
projection of the network. When we perform the one-mode 
projection of each of these three networks onto the A nodes 
(we retain the weights but remove the diagonals), the one- 
mode adjacency matrices become 
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In the one-mode projection the three networks become indis- 
tinguishable 4-cliques. 

In summary, the one-mode projection approach disregards 
important network information in two distinct steps. Firstly, 
when the projection itself is performed, all (sparse) informa- 
tion about the bipartite linkages is reduced to a dense network 
of two-point correlations. Secondly, all of the information 
contained in the weights is typically discarded in a subsequent 
thresholding operation. In the following section, we will ex- 
plain a simple way of analyzing the community structure of 
the bipartite network directly. 



III. BICLIQUE COMMUNITIES 

In analogy with the unipartite case, the basic observation 
on which our community definition relies is that a typical 
community consists of several complete sub-bigraphs|40] that 
tend to share many of their nodes. A number of complete bi- 
partite graphs are displayed in Figure[2l We now define a Ka^b 
clique as a complete subgraph with a nodes in the A node set 
and b nodes in the F node set. A Ka^t clique can be identical 
to a maximal complete subgraph or it can exist on a subset 
of the nodes of a maximal complete subgraph. Generalizing 
from 1 7], we now define a Ka^b clique community, as a union 
of all Ka^b cliques that can be reached from each other through 
a series of adjacent Ka^b cliques. We define two Ka^b cliques 
to be adjacent if their overlap is at least a ^^_i^^_i-biclique. 
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FIG. 2: Color online: Maximally connected bigraphs. The notation 
Ka^b means that the complete bigraph consists of a of the black nodes 
in the A set and b of the larger nodes nodes in the Y set. 




FIG. 3: Color online: Biclique adjacency. Two Ka^b cliques are adja- 
cent if they share at least a Ka- 1 ^b- 1 clique. In this figure we list a few 
examples. The two adjacent ^12 cliques share a ^0,1 biclique, the 
two adjacent ^13 cliques share a ^0,2 clique, the two adjacent^2,2 
cliques overlap by a ^11 clique, and the two adjacent ^2,3 cliques 
share a K12 clique. 



Another way of saying this is that the two cliques must share 
at least a—\ upper vertices and h—\ lower vertices. See Fig- 
ure [3l 

An important feature of the biclique community approach is 
that the biclique method provides an immediate context to the 
communities that are detected. In the movie-actor network, 
a list of actors is always accompanied by a list of film. It is 
immediately clear why the actors in a group belong together — 
we know the ouevre that they share. In the metabolic network 
a list of metabolites is accompanied a list of the reactions they 
participate in; this presence of context is an important help 
in determining the function of detected communities. In this 
sense, the bi-community information is more valuable that the 
one obtained by finding structure in the two unipartite pro- 
jections because it provides specific links between the com- 
munities that are present in the two node sets; we will dis- 
cuss precisely what we mean by this in the next section. The 
biclique method described here is a related to co-clustering. 
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IV. RELATION TO y^-CLIQUE COMMUNITIES 

When bipartite network information is available, the bi- 
clique community detection method is an attractive alterna- 
tive to the ^-clique algorithm. The ^-clique algorithm is un- 
able to analyze sparse network regions. This is due to the fact 
that 2-clique communities are simply the connected compo- 
nents of the network and are contain little information about 
the network structure. The first non-trivial ^-clique has size 
^ = 3. These two facts combined, result in the inability to 
analyze sparse network regions — simply because nodes must 
have at least two links in order to qualify for participation in a 
3 -clique. In networks with heavy tailed degree distributions, 
a large fraction of the nodes have degree less than two and an 
even larger fraction of nodes do not participate in cliques of 
size three or greater. 

If bipartite data is available, the biclique method is able to 
detect subtle structures. In order to understand why this is 
the case, it is useful to consider the relation between the two 
methods. We begin by revisiting Figure[T] In terms of cliques, 
Figure [T] (a) corresponds to a K/^^\ clique exemplified by four 
authors part of the same movie. Figure [T](b) corresponds to 
six adjacent ^2,1 cliques joined in a ^2,1 community. Finally, 
Figure [T](c) can be recognized as one \ clique and three 
^2,1 cliques. When considering the community structure the 
small network in Figure [T] (c), all nodes are included if we 
set the threshold at ^2,1, but we only include the nodes A = 
{a^b^c} and F = {1} if we raise the threshold and look for 
K31 communities. In this small example, we use the biclique 
technique to look 'inside' the 4-clique that arises when we 
project the small bipartite networks onto the A nodes. 

The biclique communities have clear translations in terms 
of the two un-thresholded one-mode projections. The ^2,1 
communities correspond to connected components in the pro- 
jection onto the A nodes; the two F nodes in each of the cliques 

K^^l and K^^l are linked in the one-mode projection onto the 
set of F nodes if the two cliques share a ^1^0 clique , that is, if 
the two cliques are adjacent. Similarly, the F nodes in each of 

(2) (3) 

the two cliques K2 ( and K2 { are also linked in the one-mode 
projecton onto the F network if they share a ^1^0 clique. Thus 
the community of the three adjacent ^2,1 nodes corresponds 
to a connected set of nodes (a 2-clique community) in the net- 
work of F nodes. This small example is easily generalized to 
the case of n adjacent ^2,1 cliques. A similar argument shows 
that K\ 2 communities correspond to connected components in 
the F networks. What is particularly noteworthy here, is that 
from the bipartite community detection algorithm — in addi- 
tion to the connected components — we also get a list of nodes 
in the complementary set of nodes that correspond to the con- 
nected components. These nodes do not necessarily form a 
connected component in the complementary one-mode pro- 
jection. 

The result mentioned in the previous paragraph is readily 
generalized. In fact, Ka^\ and K\^b biclique communities cor- 
respond to a- and /7-clique communities in the projections onto 
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A and F nodes, respectively. A clique { results in a a-clique 

(consider Figure [T] (a)); another clique K^^l results in another 
(2-clique. Now, if these two share a ^o,a-i clique (e.g in the 
movie-actor network, this would correspond to sharing a — 1 
actors), then these a — I nodes are fully connected an there- 
fore Si a — I clique. In other words, this corresponds to a a- 
clique community in the A one-mode projection. This result 
can be generalized to the case of Ki^h biclique communities 
and Zp-clique communities in the F projection. As is clearly 
illustrated in the examples displayed in Figure [T] this result is 
not valid going from the one mode projection to the bipartite 
case. 

In general, the biclique communities have the following re- 
lationship to the one-mode projections: A Ka^t community 
corresponds to 

1. An a-clique community D in the projection onto the A 
nodes. 

2. A /7-clique community G in the projection onto the F 
nodes. 

3. Further, in order to qualify for membership in the com- 
munity D, a node must connect to a node in G and vice 
versa. 

This is precisely why the biclique algorithm presented here 
is able to detect structures between 2-clique communities and 
3 -clique communities where the ^-clique algorithm fails to lo- 
cate structure. The ^2,2 clique communities, for example, are 
simply connected components in each one-mode projection, 
with the additional constraint that the connected component 
in each projection must be correlated with the complementary 
connected component as described in item 3. above. This is 
the precise content of the argument in the previous section that 
the biclique algorithm provides context to the communities. 
We emphasize that all of the arguments presented here apply 
to the un-thresholded version of the one-mode projections — 
thresholding the one-mode projections enhances the advan- 
tages of the biclique community detection method. 

The Ka^b clique community method possesses the advan- 
tages of the ^-clique algorithm. The most important strength 
of the ^-clique method is that distinct communities can over- 
lap by sharing their nodes. This ability is essential when ana- 
lyzing many real networks: Consider social networks: In so- 
cial networks, most actors participate many communities of 
family, friends, and work relations. The biclique algorithm 
presented here allows the same type of overlap — nodes in the 
A set can overlap with other A nodes and similarly for the F 
set. Cases where there is overlap between nodes from both 
sets of nodes are particularly interesting. As it the case with 
the ^-clique algorithm, the node overlap allows the user to 
zoom out and observe the network of communities, linked by 
common nodes. 

Another well known advantage of the ^-clique method is 
that it allows the user to change the resolution at which the 
network is observed, by adjusting the clique size k. A high 
value of k, allows the user to observe structures in the denser 
regions of the graph, whereas low values of k allows the user 



to study the structure of the sparser regions of the network. In 
the case of the Ka^t cliques, this ability is enhanced because 
we are able to vary the sizes of a and b independently of each 
other. As an example, consider the movie-actor network. We 
can search for groups of actors that have acted as ensembles 
by choosing a to be low and b to be high, or we search for 
a series of films share a small group of actors by choosing a 
high a and a small b. By varying a and b, we can systemati- 
cally probe different aspects of the community formations by 
studying the size distributions of communities and by visual 
inspection LL,30J. Section IVll elaborates on this point. 



V. DETECTING BICLIQUE COMMUNITIES 

The biclique communities are detected by a procedure anal- 
ogous to the on presented for ^-clique detection in |7], how- 
ever, some of the steps in the detection algorithm are different. 
We will describe the algorithm for detecting communities of 
size Ka^b in the following. 

Enumerate Maximal Bicliques: To find the biclique com- 
munities, we begin isolating the N maximal bicliques in the 
bipartite network under study. We use a freely available al- 
gorithm LCM (Linear time Closed itemset Miner) version 4.0 
Olil (downloaded from [32]) for this purpose. Using the list 
of maximal bicliques, we construct two {N x N) symmetric 
clique-overlap matrices La and Lp. The matrix elements of La 
contain information about the clique overlap among the nodes 
in the A set. Along the diagonal, this matrix contains the num- 
ber of A-nodes in maximal biclique /. The off-diagonal matrix 
elements contain the number of A nodes that maximal biclique 
/ and maximal biclique j have in common. The matrix Lr is 
similar but describes the overlap amongst the F-nodes. 

Threshold overlap matrices: The thresholding procedure 
goes through several steps. The first step evaluates the diago- 
nal elements. Diagonal elements greater than or equal to a are 
set to one, all other diagonal elements are set to zero. We then 
threshold the off-diagonal elements; this step is slightly more 
involved than the corresponding step in the ^-clique algorithm. 
First we set all elements of columns and rows that correspond 
to a zero diagonal element to zero. Next, we threshold the 
remaining elements, keeping only elements greater than or 
equal to a — 1 . We carry out the same procedure for matrix 
Lr, using b in the place of a in the instructions above. Each of 
the thresholded overlap matrices (let us call them L^ and Lp) 
now contain information about the overlap in each of the two 
sets of nodes. In order for us to find the Ka^b clique commu- 
nity information, we now create the final total overlap matrix 
L by only accepting the clique overlap, when it is present in 
both of the individual matrices, so we set L = L^ ALp, where 
A is the logical operation AND. The total clique overlap ma- 
trix, L, informs us about what maximal cliques are adjacent 
in the Ka-^b-i sense. 

Find connected components: The final step is to determine 
the connected components of L; each component corresponds 
to a biclique community. From the maximal bicliques that are 
members of each community, we extract the indices of nodes 
that participate in each biclique community. 
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FIG. 4: Color online: In many sparse real world networks, the num- 
ber of maximal bicliques grows linearly with the size of the input data 
base. Panel (a) shows the number of maximal bicliques found for the 
IMDb 119] and cond-mat |33] networks as a function of size of the 
networks (measured in number of edges). The solid line labeled by 
circles shows is the number of bicliques found in the IMDb data and 
the dashed line labeled by circles is the number of bicliques in the 
randomized version of the same network; the lines labeled by trian- 
gles show the same quantities for the cond-mat network. The net- 
work was randomized using a bipartite version of the algorithm sug- 
gested in |[34h . There are significant differences between the real and 
randomized data sets in the IMDb data, whereas there is little change 
for the cond-mat data. These differences are mainly due to the fact 
that, on average, there are more actors involved in the production 
of movies than there are authors of scientific papers. A forthcom- 
ing paper discusses the subject of biclique motifs in various bipartite 
networks. Panel (b) shows the growth of the bipartite adjacency ma- 
trix as a function of the number of edges included in the analysis; 
solid line marked by squares is the number of distinct movies and 
the dashed line marked by squares is the number of actors; the lines 
labeled by triangles display the number of authors (solid line) and 
the number of papers (dashed line) for the cond-mat network. The 
incremental growth of the number of movies in the IMDb network is 
explained in the main text. 



VI. NETWORK OF COMMUNITIES 

It is possible to construct a network consisting of the bi- 
clique communities. In this network, each community is a 
node and two communities are linked if they have nodes in 
common. Nodes from each partition of the network are al- 
lowed to overlap, so the network has two types of links (A- 
links and F-links), the number of overlapping nodes can be 
encoded as the link- weight. Since the communities have dif- 
ferent sizes and we would like to be able to easily access 
this information, we scale the node-size according to the to- 
tal number of members of each community. The final piece 
of information is the ratio of A-nodes to F-nodes, which we 
can obtain by coloring the node (e.g. as a pie-chart). Figure [5] 
displays a number of such networks of communities for the 
cond-mat network. 

Let us think about the expected behavior of the network of 
communities. In the case of ^i^i (cf. Figure [5] a), the network 
of communities is simply one large node displaying the frac- 
tion of A and F nodes. When we increase a and b in Ka^t^ this 
node breaks apart into smaller pieces. If the network is highly 
modular, the resulting network of communities will be quite 
sparse and many nodes will have degree zero; if the network 
is homogeneous, we find a densely interconnected network of 
communities. For a given choice of a and b, the structure of 
the resulting network of communities provides a useful way 
estimate of the information content of the individual commu- 
nities. 

The network of communities illustrates what aspects of 
community structure we are probing, when we adjust the val- 
ues of a and b. This is illustrated in Figure [S] Panel (a) shows 
the network of communities for Ki 2- Displayed here is the 
connected component in the paper-network and the pie-chart 
shows the fractions of authors and papers in the network. 

Figure|5] (b) shows the network of communities based on 
^8, 2 -cliques. The emphasis here is on a large number 
of shared authors, and as a consequence, each community 
is dominated by authors. The vast majority of links are 
dark red indicating author-overlap between the communities. 
Figures] (c) shows the network of communities for ^3 5. Here 
the ratio of authors to papers in each community mirrors the 
global ratio, and all of the communities are of similar in size. 
In this case the node overlap is equally distributed between 
author- and paper-overlap. The typical link weight in this net- 
work is zero or one. See Figure [6] for a detailed discussion of 
two ^3 5 communities. Finally, panel (d) shows the network of 
communities for ^2,12- In this case, the emphasis is on many 
shared papers, so all communities contain many more papers 
than authors (they are mostly light green). Similarly, the ma- 
jority of links are paper-links; the typical weight is small, be- 
tween zero and two, but a few heavy links also exist. This 
threshold probes a completely different aspect of the bipartite 
network than the K^ 2 communities. 

The networks in Figure [5] reveal how to analyze the net- 
work. If we wish to detect groups of longtime collaborators, 
we choose small a and large b, in this case each community 
contains only a few authors and many papers, while the over- 
lap with other communities of other longtime collaborators 
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FIG. 5: Color online: Networks of communities for in cond-mat network |33] for various choices of Ka^b- In these plots, authors are represented 
by the dark red and papers are represented by light green; thus author node-overlap is shown as a dark red link and paper overlap is shown as 
a light green link. Panel a shows the network of communities for Ki^2^ panel (b) shows the network of communities for ^8,2 » panel (c) shows 
the network for ^3 5 and panel (d) describes the case of ^2,12- See the main text for details. All panels are screen-shots from BCFinder | 3^ 



will mainly be papers. The largest community in Figure [5] (b) 
has 12 authors and 290 papers, but such a large collaboration 
is the exception rather than the rule; most communities con- 
tain longstanding theoretical collaborations among 2 — 4 au- 
thors who have written between 20 and 60 papers together. If 
we wish to search for large collaborations, we choose large a 
and small b: This allows us to find communities of large (typ- 
ically experimental) collaborations; in this case the commu- 
nities contain many authors and few papers, while the node- 
overlap with other communities consists of authors. In the 
middle interval when a is around the same size as /?, we find 
balanced groups of medium size that overlap each other both 
with papers and authors. If a network is highly modular (as is 
the case for the cond-mat network), the size of overlap is typ- 
ically very small, but in dense, more homogeneous networks, 
the overlaps can constitute a significant fraction of the nodes 
in each community. The considerations above are specific to 
the cond-mat network, but a similar analysis can be performed 
on any bipartite network. 



VII. ALGORITHMIC COMPLEXITY 

The algorithm proposed above can be used to analyze large 
sparse networks efficiently. In analogy the problem of enu- 
merating all maximal cliques (which is a classic NP complete 
problem jSSll . which must be solved to detect ^-clique com- 
munities), the problem of enumerating all maximal bicliques 
is NP complete | 36]. Roughly speaking, the problem is NP 
complete because the number of maximal bicliques, N, can 
grow exponentially as a function of the size of the input data. 
However, as we shall see in the following, this is rarely prob- 
lem in sparse real world networks. Modern algorithms exist 
that are very efficient on sparse graphs |31, 37] . The al- 
gorithm that we utilize ifJlll has a computational complexity 
of this step proportional to N in the network being analyzed 
(with respect to memory usage, this algorithm is also quite 
efficient — the memory usage scales linearly with the size of 
input data). 

Figure IH shows how N scales linearly as a function of the 
number of edges M in two large real world networks: The 
IMDb network of actors and movies L19J and the network of 
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scientific authors pubHshing in the cond-mat section of the 
arXiv database L331 . In the case of IMDb, the data for the 
plots in Figure (H was created by beginning with the network 
of all male actors and moves in 1965 and constructing the ad- 
jacency matrices and running LCM; then the data for female 
actors in movies from 1965 was added and the procedure re- 
peated. We expanded the network gradually until it encom- 
passed all movies and all actors and actresses from 1965 to 
1980. Separating the male and female actors has the conse- 
quence that the number of movies only grows half as often as 
the number of actors — this accounts for the step-like growth 
of the black solid line in Figure (U In the case of the cond- 
mat data, a similar method was used, gradually expanding the 
adjacency matrix 1992, including subsequent years incremen- 
tally until 2006. The same procedure is applied to a random- 
ized version of each data set. Figure H] (a) shows the number 
of maximal bicliques as plotted vs. the number of edges in the 
two networks (IMDb, solid black line; cond-mat, solid grey 
line). The differences between the real and randomized data 
sets (IMDb randomized, dashed black line; cond-mat, dashed 
grey line) display clearly that there is significant additional 
clique structures in the real network data. Figure |4](b) shows 
how the number of nodes grow as a function of the number 
of edges. In the case of IMDb, the final network contains 
497386 edges connecting 163416 actors to 41917 movies. 
This network contains some 372833 maximal bicliques that 
it takes the LCM algorithm 3.1 seconds to locate using a stan- 
dard lap-top with a 2.16 GHz Intel Core 2 Duo processor and 
2 GB RAM. In the case of cond-mat, the final network con- 
tains 217690 edges connecting 46622 authors to 70975 pa- 
pers. This network contains some 81697 maximal bicliques 
that it takes the LCM algorithm 0.7 seconds to locate on the 
same computer. 

Creating the overlap matrices and thresholding is 0{N^) oc 
0{M^), where M is the number of edges in the network. Find- 
ing connected components in the overlap matrices can be done 
in 0{N -\- Ml), where Ml is the number of edges of edges 
in the overlap matrix L and since this matrix is also sparse 
we have 0{N) for the connected components. These steps 
are the algorithmic bottleneck; the processing time is a lit- 
tle over 30 minutes for the cond-mat network on the hard- 
ware mentioned above. The total complexity of the algorithm 
is 0{N^) oc 0(M^)||41i]. Since the process of finding the bi- 
cliques is rather involved, we have created a tool (BCFinder 
|[3Q|1 ) that is able to automatically detect and display biclique 
communities. BCFinder may be freely downloaded. 

VIII. DISCUSSION 

We have presented a novel method for detecting communi- 
ties in bipartite networks. Our method is based on an exten- 
sion of the ^-clique community detection algorithm suggested 
by Palla et al |7], and explains the relation between the bi- 
clique communities and the communities in the corresponding 
unipartite graph. If bipartite information is available, the al- 
gorithm retains all of the advantages of the ^-clique algorithm 
(overlapping nodes, the ability to find the network of com- 



munities in a given network, etc.), avoids discarding impor- 
tant structural information when projecting the network, and 
provides a new level of flexibility due to the two threshold- 
ing parameters a and b, cf. Section [Vll The biclique method 
is computationally manageable for many sparse networks; in 
cases where the number of bicliques scales linearly with the 
number of links (as it is the case for the networks analyzed 
here), the algorimic complexity scales like O(M^), where M 
is the number of edges in the bipartite network. 

While our purpose here is mainly to present and analyze 
a new approach for detecting communities in complex bipar- 
tite networks, it is nonetheless instructive to see a small ex- 
ample of the algorithm in action. Figure [6] shows the algo- 
rithm applied to a real network, the cond-mat network of au- 
thors and scientific papers from 1996 to medio 2006. The top 
panel shows a ^3^5 -clique community of 4 authors and 1 1 pa- 
pers; this community is a group of scientists studying econo- 
physics. The bottom panel shows another ^3 5 -clique commu- 
nity, this time consisting of 5 authors and 13 papers. The topic 
of this second community is bio-physics, more specifically 
analyses of various biological time-series. A key point is that 
two authors (H. E. Stanley and L. A. N. Amaral) are members 
of both communities. The division into biclique communities 
make it immediately clear that it is important that communi- 
ties are allowed to overlap: There is no doubt that Stanley 
and Amaral are full fledged members of both communities. 
However, we also understand why the communities are dis- 
tinct: they regard different subjects. The presence of context 
(a list of authors are complemented by a list of papers and vice 
versa) highly enriches our understanding of the communities; 
this information is not available from the one-mode projec- 
tions. A list of authors and papers in these two communities 
can be found in the Appendix. 

We expect that the biclique community detection algorithm 
will be of practical importance in all areas where the net- 
works studied are bipartite (biological networks, affiliation 
networks, information networks). 
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FIG. 6: Color online: The biclique algorithm in action on the cond-mat network ISSll (years: 1996-2006). The top panel shows a ^3,5- 
clique community of 4 authors and 11 papers; this community is a group of scientists studying econo-physics. The bottom panel shows 
another ^3 5 -clique community, this time consisting of 5 authors and 13 papers. The topic of this second community is bio-physics, more 
specifically analyses of various biological time-series. A key point is that two authors (H. E. Stanley and L. A. N. Amaral) are members of 
both communities. The division into biclique communities make it immediately underlines the importance of node-overlap: There is no doubt 
that Stanley and Amaral are full members of both communities. However, it is also immediately clear why the communities are distinct: they 
regard different subjects. The presence of context (a list of authors are complemented by a list of papers and vice versa) highly enriches our 
understanding of the communities. A list of authors and papers in these two communities can be found in the Appendix. Both panels are 
screen- shots from BCFinder BSOll. 



Authors Papers 

H.E. Stanley On the Origin of Power-Law Fluctuations in Stock Prices 
P. Gopikrishnan Quantifying Stock Price Response to Demand Fluctuations 
V. Plerou Symmetry Breaking in Stock Demand 

L.A.N. Amaral Inverse Cubic Law for the Probability Distribution of Stock Price Variations 

Universal and non-universal properties of cross-correlations in financial time series 

A Random Matrix Approach to Cross-Correlations in Financial Data 

Scaling of the distribution of fluctuations of financial market indices 

Economic Fluctuations and Diffusion 

Identifying Business Sectors from Stock Price Fluctuations 

Statistical Properties of Share Volume Traded in Financial Markets 

Ivory Tower Universities and Competitive Business Firms 

TABLE I: Community displayed in top panel of Figure[6l 
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Authors Papers 

S. Havlin Scale Invariance in the Nonstationarity of Physiological Signals 

H.E. Stanley Noise Effects on the Complex Patterns of Abnormal Heartbeats 
PC. Ivanov Behavioral-Independent Features of Complex Heartbeat Dynamics 
A.L. Goldberger Sleep-Wake Differences in Scaling Behavior of the Human Heartbeat: 

Analysis of Terrestrial and Long-Term Space Flight Data 
L.A.N. Amaral Magnitude and Sign Correlations in Heartbeat Fluctuations 

Dynamics of Sleep-Wake Transitions During Sleep 

Levels of Complexity in Scale-Invariant Neural Signals 

Relation between Magnitude Series Correlations and Multifractal Spectrum Width 

Multifractality in Human Heartbeat Dynamics 

A Stochastic Model of Human Gait Dynamics 

Stochastic Feedback and the Regulation of Biological Rhythms 

Quantification of Sleep Fragmentation Through the Analysis of Sleep-Stage Transitions 
Characterization of Sleep Stages by Correlations of Heartbeat Increments 

TABLE II: Community displayed in bottom panel of Figure[6l 
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